-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent get_<DATA_ASSET> permissions - S3_Datasets #1727
Conversation
backend/dataall/modules/s3_datasets/services/dataset_location_service.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/s3_datasets/services/dataset_location_service.py
Outdated
Show resolved
Hide resolved
851f36a
to
85f5fc4
Compare
backend/dataall/modules/s3_datasets/services/dataset_service.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/s3_datasets/services/dataset_location_service.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/s3_datasets/services/dataset_location_service.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/s3_datasets/services/dataset_table_service.py
Outdated
Show resolved
Hide resolved
gql.Field(name='S3BucketName', type=gql.String), | ||
gql.Field(name='GlueDatabaseName', type=gql.String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: A call out that these 2 fields (S3BucketName
and GlueDatabaseName
) are a part of index in opensearch (backend/dataall/modules/s3_datasets/indexers/dataset_indexer.py
) - not sure really how "restricted" they are treated if that information can be found elsewhere
gql.Field(name='IAMDatasetAdminRoleArn', type=gql.String), | ||
gql.Field(name='KmsAlias', type=gql.String), | ||
gql.Field(name='bucketCreated', type=gql.Boolean), | ||
gql.Field(name='glueDatabaseCreated', type=gql.Boolean), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: making note we can remove the following from data model if we wanted to clean up sometime in the future
- bucketCreated
- glueDatabaseCreated
- iamAdminRoleCreated
- lakeformationLocationCreated
- bucketPolicyCreated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to add a migration script because backfilling it in 2.6.2 was going to be a pain; but totally agree on the cleanup
fields=[ | ||
gql.Field(name='AwsAccountId', type=gql.String), | ||
gql.Field(name='GlueDatabaseName', type=gql.String), | ||
gql.Field(name='GlueTableName', type=gql.String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to call out above that GlueDatabaseName
is a part of index in opensearch (backend/dataall/modules/s3_datasets/indexers/table_indexer.py
) - not sure really how "restricted" they are treated if that information can be found elsewhere
++ to note I would think best if anything would be to restrict on open search and keep the other server side restricted logic the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be the solution, restricting it also in OpenSearch
@@ -233,7 +233,7 @@ const FolderView = () => { | |||
const fetchItem = useCallback(async () => { | |||
setLoading(true); | |||
const response = await client.query(getDatasetStorageLocation(params.uri)); | |||
if (!response.errors && response.data.getDatasetStorageLocation !== null) { | |||
if (response.data.getDatasetStorageLocation !== null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update the error dispatch to dispatch all errors (1 or more)
@@ -69,7 +70,7 @@ export const DatasetFolders = (props) => { | |||
const response = await client.query( | |||
listDatasetStorageLocations(dataset.datasetUri, filter) | |||
); | |||
if (!response.errors) { | |||
if (response.data.getDataset != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dispatch all errors
gql.Field(name='importedS3Bucket', type=gql.Boolean), | ||
gql.Field(name='importedGlueDatabase', type=gql.Boolean), | ||
gql.Field(name='importedKmsKey', type=gql.Boolean), | ||
gql.Field(name='importedAdminRole', type=gql.Boolean), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we also want to include the following share expiration / config related information as part of restricted?
gql.Field(name='autoApprovalEnabled', type=gql.Boolean),
gql.Field(name='enableExpiration', type=gql.Boolean),
gql.Field(name='expirySetting', type=gql.String),
gql.Field(name='expiryMinDuration', type=gql.Integer),
gql.Field(name='expiryMaxDuration', type=gql.Integer),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially yes, but this information is fetched when opening a share request. At the end the user needs it for the share request creation
@@ -161,7 +161,7 @@ const DatasetView = () => { | |||
const fetchItem = useCallback(async () => { | |||
setLoading(true); | |||
const response = await client.query(getDataset(params.uri)); | |||
if (!response.errors && response.data.getDataset !== null) { | |||
if (response.data.getDataset !== null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: expose all errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with Dashboards. I don't know if adding more error banners is going to give users more useful info. If the top level query fails, then the errors that we might find in the nested resolvers are very unpredictable. For example if get_Dataset fails because the dataset is not found, the restricted is going to fail with something as 'URI is null'; which might be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it guaranteed the first error in the dispatch always the top level?
@@ -222,7 +222,7 @@ const TableView = () => { | |||
const fetchItem = useCallback(async () => { | |||
setLoading(true); | |||
const response = await client.query(getDatasetTable(params.uri)); | |||
if (!response.errors && response.data.getDatasetTable !== null) { | |||
if (response.data.getDatasetTable !== null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: expose all errors
dataset = dataset_fixture | ||
dataset.GlueDatabaseName = dataset_fixture.restricted.GlueDatabaseName | ||
dataset.region = dataset_fixture.restricted.region | ||
dataset.S3BucketName = dataset_fixture.restricted.S3BucketName | ||
dataset.AwsAccountId = dataset_fixture.restricted.AwsAccountId | ||
table1 = table(dataset=dataset, name='table1', username=user.username) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: any reason handled this way here and updated def table
factory for a similar/same required change at tests/modules/s3_datasets/conftest.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very elegant, agreed. The issue is that the dataset_fixture
parameter takes in this case the value of the gql Dataset type. While the table factory accepts db S3Dataset model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow - is it not the exact same dataset and table factories used in tests/modules/s3_datasets/conftest.py
but in that conftest we define Table factor as
@pytest.fixture(scope='module', autouse=True)
def table(db):
cache = {}
def factory(dataset: S3Dataset, name, username) -> DatasetTable:
key = f'{dataset.datasetUri}-{name}'
if cache.get(key):
return cache.get(key)
with db.scoped_session() as session:
table = DatasetTable(
name=name,
label=name,
owner=username,
datasetUri=dataset.datasetUri,
GlueDatabaseName=dataset.restricted.GlueDatabaseName,
GlueTableName=name,
region=dataset.restricted.region,
AWSAccountId=dataset.restricted.AwsAccountId,
S3BucketName=dataset.restricted.S3BucketName,
S3Prefix=f'{name}',
)
...
(note the way we set GlueDatabaseName, region, S3BucketName, and S3BucketName)
All minor enhancmenets / nits - the only pending comment I have before approval is regarding restrictions to additional dataset config info (re: #1727 (comment)) cc: @dlpzx |
### Feature or Bugfix - Feature - Bugfix ### Detail This is a continuation of #1727 but for Dashboards. - removes GET_DASHBOARD permission check on get_dashboard - adds a restriction field in the Dashboard type with the restricted information - implement resolver to return the restricted information that checks GET_DASHBOARD permissions. Return defaults for users with no permissions - Adapt frontend to use the restricted fields In addition i made some fixes in the frontend, for example, the default view I changed it to Overview because it improves the experience of a data consumer that is exploring the dashboard metadata. I also removed the forever circular progress in the viewer page if there are errors in the loading of the reader url. ### Relates - #1727 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
@dlpzx - I think will have to merge latest from OS, resolve conflicts, and make changes to Then good to go |
# Conflicts: # backend/dataall/modules/s3_datasets_shares/api/resolvers.py # frontend/src/modules/Folders/components/FolderOverview.js
- Feature - Bugfix This is a continuation of #1727 but for Dashboards. - removes GET_DASHBOARD permission check on get_dashboard - adds a restriction field in the Dashboard type with the restricted information - implement resolver to return the restricted information that checks GET_DASHBOARD permissions. Return defaults for users with no permissions - Adapt frontend to use the restricted fields In addition i made some fixes in the frontend, for example, the default view I changed it to Overview because it improves the experience of a data consumer that is exploring the dashboard metadata. I also removed the forever circular progress in the viewer page if there are errors in the loading of the reader url. - #1727 Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
- Feature/Bugfix This PR is the first part of a series and only handles S3_Datasets and its child components Tables and Folders Most API calls on a particular resource are restricted by GET_RESOURCE permissions. But for resources that are indexed in the Catalog, some metadata is considered public as it is useful for data consumers to discover and understand the data assets. Users will click on these resources from the Catalog view and call one of the following API calls: - getDataset - getDatasetStorageLocation - getDatasetTable - getDashboard - getRedshiftDataset - getRedshiftTable From the above list, initially all are decorated with resource_permission checks except for getDataset and getDatasetTable. - Public information should be available for data consumers to explore, that means that we first need to remove the resource_permission checks from the list of APIs. - Not all information is public, to get AWS information and other restricted metadata we still need to verify GET_X permissions on the resource. - For restricted metadata, we should provide default messages that do not break the frontend In addition in this PR some unused fields are removed and `syncTables` is simplified to return an integer with the count of synced tables For each of the following I tested with a user with GET permissions and without GET permissions. FE views render and show information or unauthorized to view info placeholder - [X] Dataset View, Dataset Edit Form and list Datasets - [x] Dataset Data tab with list of tables and folders - [X] Table view, Table Edit - [X] Folder view and Folder Edit Other checks - [x] Complete share request - [X] With requester check table and folder and view the details of the account... - [X] Worksheets query with owned table - [X] Worksheets query with shared table - [X] Crawl dataset - correct S3 path - [X] Sync tables - correct params Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Feature or Bugfix
Detail
This PR is the first part of a series and only handles S3_Datasets and its child components Tables and Folders
Current implementation
Most API calls on a particular resource are restricted by GET_RESOURCE permissions. But for resources that are indexed in the Catalog, some metadata is considered public as it is useful for data consumers to discover and understand the data assets. Users will click on these resources from the Catalog view and call one of the following API calls:
From the above list, initially all are decorated with resource_permission checks except for getDataset and getDatasetTable.
Target implementation
In addition in this PR some unused fields are removed and
syncTables
is simplified to return an integer with the count of synced tablesRelates
Testing
For each of the following I tested with a user with GET permissions and without GET permissions. FE views render and show information or unauthorized to view info placeholder
Other checks
Security
Please answer the questions below briefly where applicable, or write
N/A
. Based onOWASP 10.
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
eval
or similar functions are used?By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.